What kind of persistent data does Kibana stores in path.data? - elasticsearch

There is a Kibana configuration called path.data that says:
The path where Kibana stores persistent data not saved in Elasticsearch.
What kind of persistent data is stored there?
As far as I am aware, Kibana stores most of its information in Elasticsearch under its index (.kibana by default).

There isn't much documentation on this, however I could find a Elasticsearch team member response to another question. Stating:
You're correct that all Kibana saved objects are stored in Elasticsearch, in the .kibana index. It doesn't write anything to the filesystem (save for maybe some temporary files, but even that I'm pretty sure doesn't happen).
Therefore, I would say that only temporary information is stored in the path.data and all the relevant information (for either persistence or for monitorization) is stored under the .kibana index.
Can someone else confirm this?

Related

What does elastic search Store, and how?

Elastic search is a search engine, according to Wikipedia. This implies it is not a database, and does not store the data it is indexing (but presumably does store its indexes)
There are presumably 2 ways to get data into Es. Log shipping or directly via api.
Let’s say my app wants to write an old fashioned log file entry:
Logger.error(now() + “ something bad happened in module “ + module + “;” + message”
This could either write to a file or put the data directly in es using a rest api.
If it was done via rest api, does es store the entire log message, in which case you dont need to waste disk writing the logs to files for compliance etc. Or does it only index the data, so you need to keep a separate copy? If you delete or move the original log file, how does es know, and is what it Deos store still usefull?
If you write to a log file, then use log stash or similar to “put the log data in es” does es store the entire log file as well as any indexes?
How does es parse or index arbitrary log files? Does it treat a log line as a single string, or does it require logs to have a specific format such as cvs or Jason?
Does anyone know of a resource with this key info?
Elasticsearch does store the data you are indexing.
When you ingest data into elasticsearch, this data is stored in one or more index and then it can be searched. To be able to search something with elasticsearch you need to store the data in elasticsearch, it can not for example search on external files.
In your example, if you have an app sending logs do elasticsearch, it will store the entire message you send and after it is in elasticsearch you don't need the original log anymore.
If you need to parse your documents in different fields you can do it before sending the log to elasticsearch as a json document, use logstash to do this or use an ingest pipeline in elasticsearch.
A good starting point to know more about how it works is the official documentation

where is the original json document stored in lucene through elasticsearch on disk(linux-ubuntu)

after reading blogs I came to know that data is stored on lucene and only metadata stored in elasticsearch
when we index a doc through elasticsearch , we store inverted index in segment but somewhere we need to store the json doc which will be retrieved . I'm unable to figure out the location on disk after going through several blogs also.
Note - /var/lib/elasticsearh/data(mentioned in official doc and in stackoverflow question) not exists in ubuntu
Thanks in advance
You're not supposed to go store/retrieve the document on the disk, you have to go through one of the elasticsearch API to store/retrieve your document.
I think the documents are converted in a binary format. And I think the decision of hiding the files behind the API is to protect the integrity of the content, especially when working with replicated content (so that all nodes have the same content) and in order to be sure that the inverted index always reflect the real content on disk.
[citation needed]

Where / How ElasticSearch stores logs received from Logstash?

Disclaimer: I am very new to ELK Stack, so this question can be very basic.
I am setting up ELK stack now. I have below basic questions about ElasticSearch.
What is the storage model elastic search is following?
For example Oracle is using relational model ,Alfresco is using "document model" and Apache Jackrabbit is using "hierarchial model"
2.Log data stored in elastic search is persistent/permanent ? Or ElasticSearch deletes log data after certain period?
3.How we will manage/backup this data?
4.Log/data files in Elastic Search is human-readable?
Any help/route to documentation will be appreciated.
the storage model is a Document model. Everything is a document. The documents are of a particular type and they are stored in an index.
Data send to ES is stored on disk. It can be then read, searched or deleted through a REST API.
The Data is managed through the rest API. Usually for log centralisation, the logs are stored in date-based index (one index for today, one for yesterday and so on), so to delete the logs from one day, you delete the relevant index. Curator can help in this case. ES offers a backup and restore module.
To access the data in ES, you'll have to use the REST API or use the Kibana client.
Documentation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Is lucene calling datasource for every request?

I am new to Apache Lucene. Please someone guide me how apache lucene works.
For every request, will it invoke datasource(documents, database. etc) from lucene index?
or it will look at the index alone?
Once documents are indexed, Lucene will only look at the index and nowhere else.
You also need to understand the difference between indexing and storing data in the index. Former allows document to be found while latter allows the data to be read when relevant document is found.
Why is this necessary? Sometimes you can index all fields but only store the ID and retrieve the actual data from external source (e.g. database) using that ID. Or you can store data in the index and load it from there instead of going to another data source.

Resources