Retrieve large number of documents elasticsearch - elasticsearch

I have an ElasticSearch database of roughly 20 million documents (each comprising some metadata and some text data). I'm interested in retrieving all the documents containing either of two keywords in the main text (say, "apple" and "banana"). I tried doing a quick search in Kibana and found 5 million hits. How can I export them all so that I can work with the dataset in python? Is there any way to do that in Kibana?
I have tried using the CSV export functionality in Kibana but it only exports 500 docs. The standard elasticsearch search API also limits the number of documents to 10000. What's the best way to retrieve all the 5m docs?
My end goal is to perform NLP on the retrieved data.

If you want to extract a bunch of data from elasticsearch, I recommend you to use elasticdump.
https://github.com/elasticsearch-dump/elasticsearch-dump
Here is an example:
elasticdump \
--input=http://localhost:9200/your_index_name \
--type=data \
--searchBody='{"query":{"bool":{"should":[{"match":{"main_text":"apple"}},{"match":{"main_text":"banana"}}]}}}' \
--output=/path/to/output.csv

For large-scale clusters, I would recommend using Logstash and its CSV output plugin. Logstash is better suited for handling high-volume of data and gives more control over data transformation, should you need it.
I would choose elasticdump only for small to medium-sized Elasticsearch indices, for which the tool really shines.
If you are interested, you can find here a demonstrative dockercompose + configs for the elasticsearch -> logstash -> csv scenario

Related

How to export an Elastic Search index?

We are attempting to export a list of two fields (ID, text) from an existing Elastic Search index into a text file.
The index contains approximately 2 million rows of data. We tried using the reporting feature in Kibana, but ran into the 10K row limit.
Searched online for other options, but haven't found anything viable because I don't think we'll be able to install any 3rd party tools.
You may use any REST client with pit API to pagenate documents.
https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html
Another option is using one of ELK stack: Logstash, Elasticsearch input plugin and File output plugin.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-file.html

Is there a way to import data (csv data) to the winlogbeat kibana dashboard?

I had just started learning about ElasticSearch and Kibana. I created a Winlogbeat dashboard where the logs are working fine. I want to import additional data (CSV data) which I created using Python. I tried uploading the CSV file but I am only allowed to create a separate index and not merge it with the Winlogbeat data. Does anyone know how to do this?
Thanks in advance
In many use cases, you don't need to actually combine into a single index. Here's a few ways you can show combined data, in approximate order of complexity:
Straightforward methods, using separate indices:
Use multiple charts on a dashboard
Use multiple indices in a single chart
More complex methods that combine data into a single index:
Pivot indices using Data Transforms
Combine at ingest-time
Roll your own
Use multiple charts on a dashboard
This is the simplest way: ingest your data into separate indices, make separate visualizations for them, then add those visualisations to one dashboard. If the events are time-based, this simple approach could be all you need.
Use multiple indices in a single chart
Lens, TSVB and Timelion can all use multiple data sources. (Vega can too, but that's playing on hard mode)
Here's an official Elastic video about how to do it in Lens: youtube
Create pivot indices using Data Transforms
You can use Elasticsearch's Data Transforms functionality to fetch, combine and aggregate your disparate data sources into a combined data structure which is then available for querying with Kibana. The official tutorial on Transforming the eCommerce sample data is a good place to learn more.
Combine at ingest-time
If you have (or can add) Logstash in the mix, you have several options for combining datasets during the filter phase of your pipelines:
Using a file-based lookup table and the translate filter plugin
By waiting for related documents to come in then outputting a combined document to Elasticsearch with the aggregate filter plugin
Using external lookups with filter plugins like elasticsearch or http
Executing arbitrary ruby code using the ruby filter plugin
Roll your own
If you're generating the CSV file with a Python program, you might want to think about incorporating the python Elasticsearch DSL lib to run queries on the winlogbeat data, then ingest it in its combined state (whether via a CSV or other means).
Basically, Winlogbeat is a data shipper to Elasticsearch. Which ships windows specific data to an index named winlogbeat with a specific schema and document structure.
You can't merge another document with a different schema into winlogbeat index.
If your goal is to correlate different data points. Please use Time-series visual builder to overlay two different datasets to visualize.

Is there any possibility to extract Google Analytics data and post that to Elastic Search?

I have been working on ways to import Google Analytics raw data without having to use a premium account .So far this is the nearest link to what I want to do
How to extract data from Google Analytics and build a data warehouse (webhouse) from it?
I want to load that data into elastic search and display using kibana .What is the best ETL approach for this ? Has anyone tried to display GA data using ELK stack ?
You should do it in two times
First, get the info, a very very useful site is https://developers.google.com/webmaster-tools/v3/how-tos/search_analytics but you have first to have a google wembaster tool account and create oauth credential on https://console.developers.google.com/apis
Then once you have your data, find a way to import them in elasticsearch, I'm still looking for the best way to do so, maybe transform the result table into csv and then using https://www.elastic.co/guide/en/logstash/current/plugins-filters-csv.html
Have a look at this:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-http_poller.html
You can use this to poll an endpoint, in this case GA, and load the response data into Elasticsearch. You may want to filter the response with the Split and / or Mutate plugins as well.
I have done this same setup.
Extracted data from Google Analytics with 7 Dimensions and 6 Metrics, out of which 2 Dimensions were primary key (Timestamp and ID). This was done using R.
Did some transformations on the data using linux awk and sed commands.
Loaded the data into Apache Hive with the row column formatting, created like total 9 tables.
Joined all the 9 tables in Hive using Hive Join queries, with 2 primary keys.
Used elasticsearch-hadoop connector to load the final resulting table to elasticsearch. Had to do a little data transformations to match Hive and Elasticsearch data types.
Used Kibana to visualize the data in Elasticsearch.
Now I am planning to avoid all the manual steps and somehow automate all the steps above.

Comparison of Handling Logs and PDFs in Solr & Elasticsearch and Data Visualization in Banana & Kibana

How do Elasticsearch and Solr compare in respect to the following:
Indexing logs.
Indexing events.
Indexing PDF documents.
Ease of creating and distributing visualizations. Kibana vs Banana.
Support and documentation for developers.
Any help is appreciated.
EDIT
More specifically, i am trying to figure out how exactly a PDF document or an event can be indexed at all. I have worked a little bit on Elasticsearch and since i am a fan of JSON, i found it quite useful when i tried to index structured data.
For example logs are mostly structured and thus i guess easier to index and search. Now what if i want to index the whole log file itself?
Follow up
Is Kibana the only visualization tool available for Elasticsearch?
Is Banana the only visualization tool available for Solr?
Here is an answer to try to address just the Elasticsearch aspect of the post.
Take a look at https://github.com/elastic/elasticsearch-mapper-attachments for handling PDFs
For events/logs, you would need to transform those into structured data to index in Elasticsearch. You can have a field in there for the source (the log file the data came from and other information like that) - you will have all the data in the whole log file indexed in that fashion. You can take advantage of ES aggregations to group results based on log file, calculate statistics, etc.
The ELK stack is definitely worth a look.
I don't know if Kibana is the only visualization tool but it is probably the most popular and likely to offer more than something else.

Indexing logs with es-hadoop

I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html

Resources