Ways to only process new(index after last run) data in Elasticsearch? - elasticsearch

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6

I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.

I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time

Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Related

Get last document from index in Elasticsearch

I'm playing around the package github.com/olivere/elastic; all works fine, but I've a question: is it possible to get the last N inserted documents?
The From statement has 0 as default starting point for the Search action and I didn't understand if is possible to omit it in search.
Tldr;
Although I am not aware of a feature in elasticsearch api to retrieve the latest inserted documents.
There is a way to achieve something alike if you store the ingest time of the documents.
Then you can sort on the ingest time, and retrieve the top N documents.

Remove documents with Logstash

I work on several log files that I process with logstash. I divide them into several documents (multiline) and then I extract the information I want.
The problem is that I find myself in the end with several documents where I have nothing interesting and that takes me up space.
Do you know a way to delete documents where there is no information extract by logstash ?
Thank you very much for your help !
In lower versions of ElasticSearch, when creating indexes you can specify a ttl field that indicates the expiry of a document in the index. You could set the ttl to a value of say 24 hours. Read more here
However ttl has been deprecated as of version 2.0 since its a clumsy way of removing stale data, personally, i create rolling indexes with logstash and have a cron job that simply drops the daily index at eod via curl.
Also refer to this article from ES
https://www.elastic.co/guide/en/elasticsearch/guide/current/retiring-data.html

Can Beats update existing documents in Elasticsearch?

Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.

Getting elasticsearch to utilize Bro timestamps through Logstash

I'm having some issues getting elasticsearch to interpret an epoch millis timestamp field. I have some old bro logs I want to ingest and have them be in the proper orders and spacing. Thanks to Logstash filter to convert "$epoch.$microsec" to "$epoch_millis"
I've been able to convert the field holding the bro timestamp to the proper length of digits. I've also inserted a mapping into elasticsearch for that field, and it says that the type is "Date" with the format being the default. However, when I go and look at the entries it still has a little "t" next to it instead of a little clock. And hence I can't use it for my filter view reference in kibana.
Anyone have any thoughts or have dealt with this before? Unfortunately it's a stand alone system so I would have to manually enter any of the configs I'm using.
I did try and convert my field "ts" back to an integer after using the method described in the link above. So It should be a logstash integer before hitting the elasticsearch mapping.
So I ended up just deleting all my mappings in Kibana, and elasticsearch. I then resubmitted and this time it worked. Must have been some old junk in there that was messing me up. But now it's working great!

What is the advantage of multiple index in elasticsearch?

Example for this is Logstash format. They formatted their index in elasticsearch with [logstash-]YYYY.MM.DD, where a new index will be used each day. The elasticsearch itself will be used by Kibana. Is there any reason why it's being done? What is the advantage?
Advantages that come to mind:
If you're looking for Tuesday's data, you can just look in Tuesday's index.
You can delete old data more easily.
If you want to modify the mappings, you can update the template and the changes will take effect the next day. It's your choice if you want to reindex the old data or not.

Resources