How to export an Elastic Search index? - elasticsearch

We are attempting to export a list of two fields (ID, text) from an existing Elastic Search index into a text file.
The index contains approximately 2 million rows of data. We tried using the reporting feature in Kibana, but ran into the 10K row limit.
Searched online for other options, but haven't found anything viable because I don't think we'll be able to install any 3rd party tools.

You may use any REST client with pit API to pagenate documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html
Another option is using one of ELK stack: Logstash, Elasticsearch input plugin and File output plugin.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-file.html

Related

Retrieve large number of documents elasticsearch

I have an ElasticSearch database of roughly 20 million documents (each comprising some metadata and some text data). I'm interested in retrieving all the documents containing either of two keywords in the main text (say, "apple" and "banana"). I tried doing a quick search in Kibana and found 5 million hits. How can I export them all so that I can work with the dataset in python? Is there any way to do that in Kibana?
I have tried using the CSV export functionality in Kibana but it only exports 500 docs. The standard elasticsearch search API also limits the number of documents to 10000. What's the best way to retrieve all the 5m docs?
My end goal is to perform NLP on the retrieved data.
If you want to extract a bunch of data from elasticsearch, I recommend you to use elasticdump.
https://github.com/elasticsearch-dump/elasticsearch-dump
Here is an example:
elasticdump \
--input=http://localhost:9200/your_index_name \
--type=data \
--searchBody='{"query":{"bool":{"should":[{"match":{"main_text":"apple"}},{"match":{"main_text":"banana"}}]}}}' \
--output=/path/to/output.csv
For large-scale clusters, I would recommend using Logstash and its CSV output plugin. Logstash is better suited for handling high-volume of data and gives more control over data transformation, should you need it.
I would choose elasticdump only for small to medium-sized Elasticsearch indices, for which the tool really shines.
If you are interested, you can find here a demonstrative dockercompose + configs for the elasticsearch -> logstash -> csv scenario

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

How data is getting mapped in Elastic search in ELK?

I am new to the ELK and i am in the progress of learning it. In my project, they are importing the data from Amazon S3 -> File Beat -> logstash -> Elastic search -> Kibanna.
In the logstash file, they have directly importing the data and sending to the Elastic search something like below and there was no indexes mentioned in the config file,
output elasticsearch
{
hosts => ["http://localhost:9200"]
}
In Amazon s3, we have logs from Salesforce and in future we are going to implement from multiple sources.
In Elastic search, i could see 41 indexes(Used Get Curl script) is present. Assume if we keep the same setup in logstash, then all logs(Multiple sources) will be sent to elastic search in same manner. I would like to know how the data is getting mapped to the particular index in elastic search ??
In many tutorials, they have given indexes in the logstash config file so in kibanna we could see the index name along with timestamp. I have tried to check by placing a sample Mulesoft log file in Amazon S3 but i cant able to find those data in Kibanna. So shall i need to create one more new index with a name Mule along with mappings??
There is no ELK expert in my project so please guide me on how to approach this one or any references will be more helpful.
This page (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html) documents Logstash's Elasticsearch output plugin.
As you can see in the Configuration Options section, the option index is not mandatory. If this option is not specified, its default-value is logstash-%{+YYYY.MM.dd}.
With that being said, the documents will get indexed into indices with the prefix 'logstash-' followed by the date of ingestion. For example:
logstash-2020.04.07
logstash-2020.04.08
Since someone in your organization has chosen to go with the default value, this option can be left out. This explains why you can't find a particular index name in the Logstash configuration. If you need to index documents into different indices, then you'd have to set a particular value for the index option.
Elasticsearch will automatically create these indices with a dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html) if you haven't setup an explicit mapping via index templates in advance. In order to see the data in kibana, you first need to create an index pattern matching the index name.
I hope I could help you.

Is there any possibility to extract Google Analytics data and post that to Elastic Search?

I have been working on ways to import Google Analytics raw data without having to use a premium account .So far this is the nearest link to what I want to do
How to extract data from Google Analytics and build a data warehouse (webhouse) from it?
I want to load that data into elastic search and display using kibana .What is the best ETL approach for this ? Has anyone tried to display GA data using ELK stack ?
You should do it in two times
First, get the info, a very very useful site is https://developers.google.com/webmaster-tools/v3/how-tos/search_analytics but you have first to have a google wembaster tool account and create oauth credential on https://console.developers.google.com/apis
Then once you have your data, find a way to import them in elasticsearch, I'm still looking for the best way to do so, maybe transform the result table into csv and then using https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html
Have a look at this:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-http_poller.html
You can use this to poll an endpoint, in this case GA, and load the response data into Elasticsearch. You may want to filter the response with the Split and / or Mutate plugins as well.
I have done this same setup.
Extracted data from Google Analytics with 7 Dimensions and 6 Metrics, out of which 2 Dimensions were primary key (Timestamp and ID). This was done using R.
Did some transformations on the data using linux awk and sed commands.
Loaded the data into Apache Hive with the row column formatting, created like total 9 tables.
Joined all the 9 tables in Hive using Hive Join queries, with 2 primary keys.
Used elasticsearch-hadoop connector to load the final resulting table to elasticsearch. Had to do a little data transformations to match Hive and Elasticsearch data types.
Used Kibana to visualize the data in Elasticsearch.
Now I am planning to avoid all the manual steps and somehow automate all the steps above.

Elasticsearch indexes but does not store documents

I'm having troubles storing documents within a 3-node Elasticsearch cluster that previously was able to store documents. I use the Java API to send bulks of documents to Elasticsearch, which are accepted (no failure in BulkResponse object) AND Elasticsearch has heavy index activities. However, the number of documents are not increased and I assume that none of them are store.
I've looked into Elasticsearch logs (of all three nodes) but I see no errors or warnings.
Note: I've had to restart two nodes previously but search/query is working perfectly. (the count in the image starts at ~17:00 as I've installed the Marvel plugin at this time)
What can I do to solve or debug the problem?
Sorry for this point blank code blindness by me! I forgot to skip the cursor when reading from MongoDB and therefore re-inserted the same 1000 documents into Elasticsearch for thousands of times!
Learning: If this problem occurs check if you select the correct documents in your database and that these documents are not already stored in ES.
Sidenote to Marvel: It would be great is this could be indicated in any way - e.g. by having a chart with "updated documents" (I rechecked and could not find one)

Resources