Elasticsearch entity centric indexing

Elasticsearch entity centric indexing - elasticsearch

does anyone have any experience with entity centric indexing with elasticsearch by using python and groovy scripts for reindex the event centric index into entity centric indexes where every log message has own index or so?
I've got a lot of following messages:
Jul 23 09:24:16 msda msda-core[5147]: 1563866656876839.mt
Jul 23 09:24:18 msda msda-core[5210]: 1563866656876839.0.dn
where I have a lot of the same id numbers with .mt suffix and .dn suffix.
I always need to find the message with the same id number and appropriate dn suffix if a message with .dn suffix appears within one hour.
Any idea would be appreciated!

If you are running v7.2 or later, I would recommend using Elasticsearch data frame transforms for creating an event centric index grouped by the id numbers.. https://www.elastic.co/guide/en/elastic-stack-overview/current/ml-dataframes.html
Use a scripted metric on the min and max timestamp to calculate duration. There is a good example in the Elastic documentation - https://www.elastic.co/guide/en/elastic-stack-overview/7.3/example-clientips.html

Related

Does ElasticSearch Keep Count The Number Of Times A Record Is Returned In A Given Period Of Time?

I have an ElasticSearch instance and it does one type of search - it takes a few parameters and returns the companies in its index that match the parameters given.
I'd like to be able to pull some stats that essentially says "This company has been returned from search queries X number of times in the past week".
Does ElasticSearch store metadata that will allow to pull this kind of info from it? If this kind of data isn't stored in ES out of the box, is there a way to enable it?

Elasticsearch (not ElasticSearch ;) ) does not do this natively, no. you can build something using the slow log, where you set the timing to 0 to get it to log everything, but that then logs everything which may not be useful/too noisy
things like https://www.elastic.co/enterprise-search, built on top of Elasticsearch, do provide this sort of insight

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.

there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Elasticsearch / Kibana: Subtraction across pre-aggregated time-series data

I am working with Johns Hopkins University CSSE COVID19 data, published on their GitHub. Some of the metrics they publish in their daily US reports are sum aggregations. I would like to perform basic math against the values within a given field so that I can get a daily tally.
JHU publishes their data daily, so let's assume that the numbers reported reflect a 24-hour period.
Example: In the State of New York, I can see the following values for Last_Update and Recovered, where Recovered is a rolling sum of all cases where people have recovered from infection:
Last_Update,Recovered
2020-08-05,73326
2020-08-04,73279
2020-08-03,73222
2020-08-02,73134
2020-08-01,73055
2020-07-31,72973
Ideally, I would like to create a new field (be it a scripted field, or a new field that is generated via a Logstash Filter processor) called RecoveredToday, where the field value reflects the difference between today's Recovered aggregation and yesterday's Recovered aggregation.
Last_Update,Recovered,RecoveredToday
2020-08-05,73326,47
2020-08-04,73279,57
2020-08-03,73222,88
2020-08-02,73134,79
2020-08-01,73055,82
2020-07-31,72973,...
In the above data set, RecoveredToday is calculated from the value of Recovered on 2020-08-05 minus the value of Recovered on 2020-08-04.
73326 - 73279 = 47
With respect to using a Scripted Field in Kibana, according to this blog article, Scripted Fields can only analyze fields within one given document at a time, and cannot perform calculations against a field across multiple documents.
I do see user #agung-darmanto solved a similar problem on StackOverflow, but the solution calls out specific dates rather than performing rolling calculations. It's also unclear from the code snippet if the results are being inserted into a new field that can subsequently be used to build visualizations.
The approach to use Logstash ruby processing on the fly also presents a problem. Logstash, as far as I know, cannot access an already ingested document ... and if it can, it's probably a pretty ugly superpower to wield.
Goal: There are other fields provided in the JHU CSSE data which are also pre-aggregated. I would like to produce visualizations that reflect trends such as:
Number of new cases per day
Number of new hospitalizations per day
Number of new deaths per day
Using the data they provide, I can build visualizations that will plateau, and that plateau reflects a reduction of incidences. I'm trying to produce visualizations that show ZERO.

ElasticSearch rollover index based on field

I'm a baby sitter learning Elastic Search. Here I'm pushing data from Kafka to ES using Kafka Connect. The data index size just grows and grows to TB's and its not easy to perform a search until I realized to have a new index hourly/daily, picking the date from the Document.
My document looks like:
{  
   "base":{  
      "message":"",
      "timestamp":"2019-08-09T13:20:11.877Z",
"type":"vpc"
   },
   "ecs":{  
      "version":"1.0.0"
   }
}
Now could I use the timestamp and type from the document to form a new index like 'vpc-2019.08.09-1'? That helps me create and direct documents to the index based on type and timestamp.
Taking a sample, we have an alias 'foo' which is defined as a time based alias with index format as foo-yyyy.mm.dd
We get a document at Jan 10, 2018 to write to the index. The ES client infers the index to be written to is foo-2018.01.10 and writes the data to the specified index or creates it if required.
We get another document at Jan 11, 2018 to write. Index inferred will be foo-2018.01.11 and written.
I came across this, based on system time. Doesn't help.
Any suggestions?

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6

I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.

I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time

Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio