Summarization in Elasticsearch - elasticsearch

I am a newbie to Elasticsearch. We are currently using Splunk platform for our analytics application and looking to migrate to ELK. Splunk provides options to schedule searches to run in background periodically and to store the search results in a separate summary index. Is similar functionality available in Elasticsearch? If so, please point me to the documentation containing the process.
Thanks,
Keerthana

This is a great use case. Of course Elasticsearch can perform such tasks, but it is more manual. You have to write your own script. So for example, if you want to summarize data, you can use ElasticSearch aggregations, and take the result (which comes in JSON format) and store it back into an index where you keep summary data. This way, even if you delete your raw data, your summary data lives on.
Elasticsearch comes with different clients. I like to use the Python Elasticsearch DSL library.

Related

Which tools i can use to query the data stored on elasticSearch and generate an alert on top of those query

I want to query TB's of data stored on elasticSearch and generate around 500 alert on top of that query. I need a plugin or some free source tool to achieve this.
Can prometheus and ElastAlert help me in achieving this, if not then which tool?
ElastAlert is the tool i was looking for.
Following is the description and reference link.
ElastAlert 2 is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in Elasticsearch and OpenSearch.
If you have data being written into Elasticsearch in near real time and want to be alerted when that data matches certain patterns, ElastAlert 2 is the tool for you.
https://elastalert2.readthedocs.io/en/latest/elastalert.html

Can Beats update existing documents in Elasticsearch?

Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.

Comparison of Handling Logs and PDFs in Solr & Elasticsearch and Data Visualization in Banana & Kibana

How do Elasticsearch and Solr compare in respect to the following:
Indexing logs.
Indexing events.
Indexing PDF documents.
Ease of creating and distributing visualizations. Kibana vs Banana.
Support and documentation for developers.
Any help is appreciated.
EDIT
More specifically, i am trying to figure out how exactly a PDF document or an event can be indexed at all. I have worked a little bit on Elasticsearch and since i am a fan of JSON, i found it quite useful when i tried to index structured data.
For example logs are mostly structured and thus i guess easier to index and search. Now what if i want to index the whole log file itself?
Follow up
Is Kibana the only visualization tool available for Elasticsearch?
Is Banana the only visualization tool available for Solr?
Here is an answer to try to address just the Elasticsearch aspect of the post.
Take a look at https://github.com/elastic/elasticsearch-mapper-attachments for handling PDFs
For events/logs, you would need to transform those into structured data to index in Elasticsearch. You can have a field in there for the source (the log file the data came from and other information like that) - you will have all the data in the whole log file indexed in that fashion. You can take advantage of ES aggregations to group results based on log file, calculate statistics, etc.
The ELK stack is definitely worth a look.
I don't know if Kibana is the only visualization tool but it is probably the most popular and likely to offer more than something else.

Indexing logs with es-hadoop

I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html

elasticsearch-hadoop for Spark. Send documents from a RDD in different index (by day)

I work on a complex workflow using Spark (parsing, cleaning, Machine Learning ...).
At the end of the workflow I want to send aggregated results to elasticsearch so my portal could query data.
There will be two types of processing: streaming and the possibility to relaunch workflow on all available data.
Right now I use elasticsearch-hadoop and particularly the spark part to send document to elasticsearch with the saveJsonToEs(myindex/mytype) method.
The target is to have an index by day using the proper template that we built.
AFAIK you could not add consideration of a feature in a document to send it to the proper index in elasticsearch-hadoop.
What is the proper way to implement this feature?
Have a special step using spark and bulk so that each executor send documents to the proper index considering the feature of each line?
Is there something that I missed in elasticsearch-hadoop?
I tried to send JSON to _bulk using saveJsonToEs("_bulk") but the pattern has to follow index/type
Thanks to Costin Leau, I have the solution.
Simply use dynamic indexing with something like saveJsonToEs("my-index-{date}/my-type"). "date" have to be a feature in the document that has to be send.
Discussion on elasticsearch google group: https://groups.google.com/forum/#!topic/elasticsearch/5-LwjQxVlhk
Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/master/spark.html#spark-write-dyn
You can use : ("custom-index-{date}/customtype") to create dynamic index. This could be any field in given rdd.
If you want format the date : ("custom-index-{date:{YYYY.mm.dd}}/customtype")
[Answered to question ask by Amit_Hora in the comment, as I don't have enough privilege to comment, I am adding this here]

Resources