Get the latest document version and aggregate the results - elasticsearch

My index contains a lot of documents, each of them has several versions, for example:
{"doc_id": 13,
"version": 1,
"text": "bar"}
{"doc_id": 13,
"version": 2,
"text": "bar"}
{"doc_id": 13,
"version": 3,
"text": "bar"}
{"doc_id": 14,
"version": 1,
"text": "foo"}
{"doc_id": 14,
"version": 2,
"text": "bar"}
I want to get the last version for each document, and aggregate them (last versions) using terms aggregation.
I've tried to use top hits to retrieve last versions:
{"size" :0,
"aggs" : {
"doc_id_groups" : {
"terms" : {
"field" : "doc_id",
"size" : "0"
},
"aggs" : {
"docs" : {
"top_hits" : {
"size" : 1,
"sort" : {
"version" : {
"order" : "desc"
}
}
}
}
}
}
}
}
But I can't do aggregation, because top hits doesn't support sub aggregations.
I guess retrieving ids and then aggregating them would be very heavy operation for the client.
Maybe scripting could help?
Update: one thing I forgot to mention: before aggregating the documents are filtered by time range, so we don't know which version is the latest at index time, only at search time

From the provided samples and additional details in chat I do not think you could achieve the required results using the aggregation. But I can propose an alternative solution instead:
Add property "current" of type Boolean which
will be set to true for all the latest versions of the documents. If
a new version is inserted - "current" will be set to false
in an older version and set to true in a newer one.
Add property "timepoints" which will contain multiple values. In the end of the day (any other period can be used) for all the
current records add the current timestamp (or any other id of the
period, e.g. "09.30.2016", or "Jan") to the "timepoints"
array.
Pros:
You can easily retrieve the current records at some point of time just checking whether the time point is in the "timepoints" array.
You can retrieve all the available time points from all the documents with a single query.
You can do the aggregation by time points, e.g. to count all the records at every point of time.
No need to maintain multiple indices, duplicates of the records etc., the algorithm is pretty straightforward.
Cons:
No possibility to get the current versions at an arbitrary point of time, just the ones when the calculation was performed.
The overall size of the "timepoints" arrays may increase significantly if you run the calculation too often and you have millions of records.
Workarounds:
For more fine grained statistics run the calculation on an hourly basis. But once a day (or month, or year) remove some of the time points from the "timepoints" array for older periods of time. In the end you will have a set of time points that will correspond to every year (in case it was more than a year ago), every month (in case it was more than a month ago), every day (in case it was more than a day ago), and every hour for the latest period. Of course the algorithm of removal of time points can be improved according to you needs.
If you are mostly working with the latest versions of the records - store them in a separate index, store the older versions in another one. In this case you don't even need the "current" property, just run through all the records in your current index and add the time stamp.
I can provide you all the queries you need for the above mentioned steps in case of a need.

You should look at solving this client side. I can think of two ways to approach it.
Use the scroll api to go through all the documents and find the latest version of each. Then again client side, aggregate by text.
Use an elasticsearch terms aggregation on doc_id with a subaggregation of a max aggregation on version. This will give you the latest version for each document id. Then create a boolean OR terms filter that uses the doc_id and version from the first part. This filter should then have a terms aggregation on text.
Either way, you need to do some client side work. I don't believe scripting will help. If you already know the latest version number for each document, then this is a lot easier.

Related

Painless script with Spring Data Elasticsearch

We are using Spring Data Elasticsearch to build a 'fan out on read' user content feed. Our first attempt is currently showing content based on keyword matching and latest content using NativeSearchQueryBuilder.
We want to further improve the relevancy order of what is shown to the user based on additional factors (e.g. user engagement, what currently the user is working on etc).
Can this custom ordering be done using NativeSearchQueryBuilder or do we get more control using a painless script? If it's a painless script, can we call this from Spring Data ElasticSearch?
Any examples, recommendations would be most welcome.
Elasticsearch orders it result by it relevance-score (which marks a result relevancy to your search query), think that each document in the result set includes a number which signifies how relevant the document is to the given query.
If the data you want to change your ordering upon is part of your indexed data (document fields for example), you can use QueryDSL, to boost the _score field, few options I can think on:
boost a search query dependent on it criteria: a user searches for a 3x room flat but 4x room in same price would be much better match, then we can: { "range": { "rooms": { "gte": 4, "boost": 1 }}}
field-value-factor you can favor results by it field value: more 'clicks' by users, more 'likes', etc..,
random-score if you want randomness in your results: different
result every time a user refreshes your page or you can mix with existing scoring.
decay functions (Gauss!) to boost/unboost results that are close/far to our central point. lets say we want to search apartments and our budget is set to 1700. { "gauss": { "price": { "origin": "1700", "scale": "300" } } } will give us a feeling on how close we are to our budget of 1,700. any flat with much higher prices (let's say 2,300) - would get much more penalized by the gauss function - as it is far from our origin. the decay and the behavior of gauss function - will separate our results accordingly to our origin.
I don't think this has any abstraction on spring-data-es and I would use FunctionScoreQueryBuilder with the NativeSearchQueryBuilder.

ElasticSearch Index Modeling

I am new to ElasticSearch (you will figure out after reading the question!) and I need help in designing ElastiSearch index for a dataset similar to described in the example below.
I have data for companies in Russell 2000 Index. To define an index for these companies, I have the following mapping -
`
{
"mappings": {
"company": {
"_all": { "enabled": false },
"properties": {
"ticker": { "type": "text" },
"name": { "type": "text" },
"CEO": { "type": "text" },
"CEO_start_date": {"type": "date"},
"CEO_end_date": {"type": "date"}
}
}
}
`
As CEO of a company changes, I want to update end_date of the existing document and add a new document with start date.
Here,
(1) For such dataset what is an ideal id scheme? Since I want to keep multiple documents should I consider (company_id + date) combination as id
(2) Since CEO changes are infrequent should Time Based indexing considered in this case?
You're schema is a reasonable starting point, but I would make a few minor changes and comments:
Recommendation 1:
First, in your proposed schema you probably want to change ticker to be of type keyword instead of text. Keyword allows you to use terms queries to do an exact match on the field.
The text type should be used when you want to match against analyzed text. Analyzing text applies normalizations to your text data to make it easier to match something a user types into a search bar. For example common words like "the" will be dropped and word endings like "ing" will be removed. Depending on how you want to search for names in your index you may also want to switch that to keyword. Also note that you have the option of indexing a field twice using BOTH keyword and text if you need to support both search methods.
Recommendation 2:
Sid raised a good point in his comment about using this a primary store. I have used ES as a primary store in a number of use cases with a lot of success. I think the trade off you generally make by selecting ES over something more traditional like an RDBMS is you get way more powerful read operations (searching by any field, full text search, etc) but lose relational operations (joins). Also I find that loading/updating data into ES is slower than an RDBMS due to all the extra processing that has to happen. So if you are going to use the system primarily for updating and tracking state of operations, or if you rely heavily on JOIN operations you may want to look at using a RDBMS instead of ES.
As for your questions:
Question 1: ID field
You should check whether you really need to create an explicit ID field. If you do not create one, ES will create one for that is guaranteed to be unique and evenly distributed. Sometimes you will still need to put your own IDs in though. If that is the case for your use case then adding a new field where you combine the company ID and date would probably work fine.
Question 2: Time based index
Time based indices are useful when you are going to have lots of events. They make it easy to do maintenance operations like deleting all records older than X days. If you are just indexing CEO changes to 2000 companies you probably won't have very many events. I would probably skip them since it adds a little bit of complexity that doesn't buy you much in this use case.

Elasticsearch aggregation on latest documents

I have a document which can be modified any number of times a day.
I've ordered these document in time series creating index for each day.
And each day would have multiple versions of the same document with different modified date.
Document sample:
{
id: 1234,
user: kc,
subscriptions: [
'paper1',
'paper2'
],
modified_date: 1466697434020
}
What I'm looking for is to get the latest documents in a particular time range for all users
and to apply aggregation on top of it.
That would give a result like, in the last week/month how many people are subscribed for each of the papers.
Using top_hits I was able to get the latest document for different users in a time range, but I cannot apply further aggregations on this set of data.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

Drawing "opened count" over time given open-event and close-event documents

I have documents modeling the creation of a ticket, such as:
{
"number": 12,
"created_at": "2015-07-01T12:16:17Z",
"closed_at": null,
"state": "open"
}
At some point in the future, a second document models the closing event:
{
"number": 12,
"created_at": "2015-07-01T12:16:17Z",
"closed_at": "2015-07-08T8:12:42Z",
"state": "closed"
}
Problem: I want to draw the history of opened tickets. In the example above, I'd like ticket number 12 to contribute to the count on the whole 2015-07-01 to 2015-07-08 timespan. What I tried:
Bucketing with date_histogram only seems to be able to give the number of tickets created or closed on any given date bucket.
Scripted metrics only seem to allow me to change the metric computation, not the particular bucketing of a document.
This is my very first day playing with Elastic Search and Kibana so I might be missing something obvious. Especially, I cannot tell if buckets act as partitions (hence if a document can only be in a single bucket), and hence if my problem can only be solved by creating additional documents for each datapoint I want to appear on the graph.
Additional note: I have control over the feeding process and the schema if storing additional data can help, but I'd like to avoid doing so if possible.
Though thats not a big deal , either mantain Hashing on the basis of Dates, or keep the
created_at
as a grouping key for documents made on a day , so that you can distinguish and query them as you want !!

Resources