Index existing documents on startup - elasticsearch

I'm new to elasticsearch and this is a question I've been trying to find an answer to. Basically I have around a thousand documents that I would like elasticsearch to index for me. Do I have to write a bash/python script that would just use CURL to put/post all these documents in my elasticsearch server or can I configure my server so that it would automatically index documents in a specific folder/location on disk when I start it up for the first time?

I far as I know Elasticsearch does not have any option for pulling document to index itself. As you mentioned you need to create a script and push your documents to ES yourself.

Related

Copy documents in another index on creation in Elasticsearch

We want to keep track of all the changes of a document, so we want to store all the document versions in separate index.
Is there a way when a new document is added or changes to send the entire document in another index? Maybe there is a processor for this use case?
As far as I know, Elasticsearch as such supports only version numbers but there is no way to trace back to previous version.
You could maintain version history in a seperate elastic index
Whenever you update main_index ensure that you update main_index as well
POST main_index/_doc/doc_id
POST main_index/_doc/doc_id_version
May be you can configure logstash to do this...not sure

How data is getting mapped in Elastic search in ELK?

I am new to the ELK and i am in the progress of learning it. In my project, they are importing the data from Amazon S3 -> File Beat -> logstash -> Elastic search -> Kibanna.
In the logstash file, they have directly importing the data and sending to the Elastic search something like below and there was no indexes mentioned in the config file,
output elasticsearch
{
hosts => ["http://localhost:9200"]
}
In Amazon s3, we have logs from Salesforce and in future we are going to implement from multiple sources.
In Elastic search, i could see 41 indexes(Used Get Curl script) is present. Assume if we keep the same setup in logstash, then all logs(Multiple sources) will be sent to elastic search in same manner. I would like to know how the data is getting mapped to the particular index in elastic search ??
In many tutorials, they have given indexes in the logstash config file so in kibanna we could see the index name along with timestamp. I have tried to check by placing a sample Mulesoft log file in Amazon S3 but i cant able to find those data in Kibanna. So shall i need to create one more new index with a name Mule along with mappings??
There is no ELK expert in my project so please guide me on how to approach this one or any references will be more helpful.
This page (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html) documents Logstash's Elasticsearch output plugin.
As you can see in the Configuration Options section, the option index is not mandatory. If this option is not specified, its default-value is logstash-%{+YYYY.MM.dd}.
With that being said, the documents will get indexed into indices with the prefix 'logstash-' followed by the date of ingestion. For example:
logstash-2020.04.07
logstash-2020.04.08
Since someone in your organization has chosen to go with the default value, this option can be left out. This explains why you can't find a particular index name in the Logstash configuration. If you need to index documents into different indices, then you'd have to set a particular value for the index option.
Elasticsearch will automatically create these indices with a dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html) if you haven't setup an explicit mapping via index templates in advance. In order to see the data in kibana, you first need to create an index pattern matching the index name.
I hope I could help you.

Reindex all of ElasticSearch with Curator?

Is there a Recipe out there to Reindex all ElasticSearch Indices with Curator?
I'm seeing that it can Reindex a set of indices into one (Daily to Month use case), however I don't see anything that would suggest it could easily apply a new mapping file to every Elastic Index.
I'm taking a guess I'll need to write a wrapper script around Curator to grab index names and feed them into Curator.
I don't know if I got you right as you mentioned reindexing and mapping changes...
If you want to set/update a mapping in a collection of indices and if you know the indices to update by name (or pattern), you are able to apply the same mapping or a mapping change at once with https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html#_multi_index_2
For reindexing, there is no way to specify multiple source/target pairs at once but you can split one index into many. But as you sugessted, you can use subsequent calls to the reindex api.
BTW: The reindex api does not copy the settings nor mappings from the source into the destination index. You need to handle it by yourself, maybe using https://www.elastic.co/guide/en/elasticsearch/reference/6.4/indices-templates.html

How to use Elasticsearch to make files in a directory searchable?

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?
Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?
I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.
What is the best most efficient way to do this?
I have looked at;
updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
??
Elasticsearch version 5.6
I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.
I am running es queries via spark and would prefer NOT to look through
all documents that I have already processed. Instead I would like read
the only documents that were ingested between the last time the
program ran and now.
A workaround could be :
While inserting data using Logstash to Elasticsearch, Logstash appends a #timestamp key to the document which represents the time (in UTC) at which the document is created or we can use an ingest pipline
After that we can query based on the timestamp.
For more on this please have a look at :
Mapping changes
There is no way to ask ES to insert a timestamp at index time
Elasticsearch doesn't have such functionality.
You need manually save with each document date. In this case you will be able to search by date range.

Resources