Elastic Search Number of Document Views - algorithm

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.

To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

Related

Reasons & Consequences of putting a Date in Elastic Index Name

I am looking at sending my App logs to Elastic (6.x) via FileBeat and Logstash. As mentioned in Configure the Logstash output and recommended elsewhere, it seems that I need add the Date to the Index name. The reason for doing so was that when the time came to delete old data, it was easier to delete an entire Index by date, rather than individual documents. Is this true?
If I should be following this recommendation of adding the Date to the Index Name, I’m curious what additional things I need to do to ensure seamless querying? By this I mean querying esp. in Kibana, for e.g. over the past day which would need to look at today’s index as well as yesterday’s index.
Speaking of querying in Kibana, is there a way of simply working with the base index name without the date stamp i.e. setting it up so that I do not see or have to deal with the date named indexes?
Edit: Kamal raised a good point that I have not provided any information about my cluster and my needs. The following is what I'm working with:
What is your daily data creation/expected count
I'm not sure. I don't expect anything more than a GB of data day, and no more than a couple of 100K documents a day. Since these are logs, I don't expect any updates to the documents once they are created.
Growth rate of the data in the future (1 year - 5 years)
At the moment, I don't see the growth rate to cross a GB a day.
How many teams are using the same cluster apart from yours if there is
any
The cluster would be used (actually queried) by just my team. We are about 5 right now, but I don't see more than 10 users (and that's not concurrent, just over a day or month)
Usage patterns, type of queries used etc.
I'm not sure, but there certainly would not be updates to the data other than deletions
Hardware details
I've not worked this out with management. For most part I expect 3 nodes. Also this is not critical i.e. if we lose all of our logs for some reason, I would not lose sleep over it.
First of all you need to take a step back and understand do you really need multiple index or single one(where you need to filter documents while querying using a date field for a particular date).
Some of questions you must have before you take on such decision
What is your daily data creation/expected count
Growth rate of the data in the future (1 year - 5 years)
How many teams are using the same cluster apart from yours if there is any
Usage patterns, type of queries used etc.
Hardware details
Advantages
In a way, having multiple indexes(with date field as its index name) would be more beneficial.
You can delete the old indexes without affecting new ones.
In case if you have to change the mapping, you can do so with the new index without affecting the old ones. Comparatively less overhead while for single index, you have to reindex all the documents which would take lot more time if size is pretty huge. And if this keeps happening every now and then, you would need to come up with solution where you have to execute such operations at the times of minimal usages. That means, it can harm productivity.
searching using multiple indexes still is convenient.
not really sure but its easier for scaling using multiple indexes.
Disadvantages are:
Additional shards are created for each and every index that can waste some storage space.
Overhead to maintain multiple indexes by monitoring/operations team.
At times can lead to over-creation of indexes.
No mapping changes and less documents insertion(in 100s or few 100s), it'd be better to use single index.
The only way and the only correct way to figure out what's best is to have a cluster that closely resembles the production one with data too resembling to production, try various configurations and see which solution fits best.
Speaking of querying in Kibana, is there a way of simply working with
the base index name without the date stamp i.e. setting it up so that
I do not see or have to deal with the date named indexes?
Yes there is. If you have indexes with names like logs-0001, logs-0002, you can use logs-* as indexname when you query.
Including date in an index name is a very common use case implemened by many Elasticsearch users. It helps with archiving/ purging old indices as you mentioned. You dont need to do anything additionally to be able to query. Setup your index basename as an index pattern for your indices for ex. logstash-* and you can query on that particular index pattern in Kibana.

ElasticSearch Frequent Updates

We have a rather difficult set of requirements for our search engine replacement and they go as follows.
Every instance will have a unique schema, we have multiple client installations that we don't control that have varying data structures
Frequent updates, it's not uncommon for every record to have a field be updated in a single action. Some fields are updated frequently, others are never changed
Some of our fields can be very large (50mb+) though these are never changed and are rare in a data set.
We'd like to have near real-time search if possible
We're looking at making the fields that are updated semi-frequently/frequently into child documents. The issue with this is that we have a set of tags that change quite frequently on the record that we want to search against in near real time. There is a strong expectation in our application that when this data is modified that searching immediately reflect that. We've tried child documents, but they don't seem to update as quickly as we'd like over a large data set.
So the questions are as follows:
Are there strategies I'm not aware of for updating child documents quickly? Maybe a plugin? Right now we're only using the RESTFUL interface
Would it be better to store the data that isn't frequently changed in ES but keep the tags in a database? Possibly creating a plugin in ES that maps the two together? Would this plugin in be difficult? Ideally, we'd be able to mix our searches together (Tags+regular ES queries) in a boolean fashion including the tags stored in a table.
Hopefully this will be helpful to other people in this situation, here is the solution I came up with.
Use Child/Parent documents
There was a single parent that contained static information for the record that rarely/never changes (bulk of the data indexed)
Create child documents for other data I wanted to index so they could be indexed independently of the primary document
Since I had split the record data I wanted to index into static and non static documents, then broke that non static data into further child documents I was able to create a high throughput indexer. The total number of records to be indexed were split into sub chunks, which were then further split into their child document types. I would split these chunks out to various indexer instances which would then be only limited by the throughput of the data source or the ES cluster in determining how many documents could be indexed per second.
This was all done through the bulk API. Keeping the static data away from the frequently changing data allowed the frequently changed data to be updated quite quickly and this speed was only limited by the available hardware. It was a little tougher to craft queries using the child document clauses and aggregates but everything seemed to work.
Notes
There is a performance penalty to using parent/child documents which was a non issue for us considering what ES gave us over our previous solution but it may cause issues for other implementations.

How to make the search most efficient?

For a property sale/rent website, a search function should be provided. At the same time, users can use the filters to get the result they want most.
Normally, there are many attributions of a property, like the price, address, the year built, area, many amenties such as balcony, washing-machine and so on. maybe it's over 100.
So how to design the database(mysql or other nosql) and artitecher to make the search performance to be the most efficient?
Sounds like your application requires a lot more search queries than update queries, and that the search queries are quite diverse.
In this case, try ElasticSearch: You choose some database where you store and modify your data. Then, you should propagate any update to an ElasticSearch index, where you upload a denormalized view of the data, which is closer to what users will expect to get when searching.
https://www.quora.com/Whats-the-best-way-to-setup-MySQL-to-Elasticsearch-replication

Elasticsearch separating out data into indexes

I have three different data sources which get updated at separate times each day. My first idea was to combine all the data into a single index but I'm wondering if it's more sensible to keep each data source in their own index. That way when a data source gets updated, I can just refresh one index.
When it comes to searching I'll just search all index. Is this a sensible approach or will it introduce a lot of overheads by separating it out.
James
If it makes sense to merge the indices you can do so, but if you want the flexibility of refreshing only one source - you should keep them separated.
I'm not sure if you're aware of aliases: you can define an alias that will include all the three indices - so that from a "user" perspective you don't have to search "all the indices" - it'll be transparent to the user that it's actually not a single index.

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

Resources