Can elastic search indexed data be put into HDFS. Not sure about the fact so thought to get expert view on it.
Not certain exactly what you are looking for. If you want to backup/restore data into HDFS from Elasticsearch the answer is answer is the Hadoop HDFS snapshot/restore plugin for Elasticsearch:
https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs
This allows you to back up and restore data from ES into HDFS.
If on the other hand you want to run MapReduce jobs in Hadoop that access Elasticsearch data the answer is Elasticsearch for Apache Hadoop:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html
Related
I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html
I would like to integrate apache Hadoop and elasticsearch for my project.Where the data will be stored into HDFS at the same time will be also be available on elastic search.
Any changes which I perform on the data in elastic search should also be reflected and stored in the HDFS. So basically I am looking for bidirectional flow between elastic search and hdfs. I have tried searching web for any useful documentation but didn't find anything helpful. Any help in this regard will be really helpful.
Thanks
See Elasticsearch for hadoop and Map Reduce You can index the data from HDFS into ES in real time for new, modified and deleted data. But from ES into HDFS may not possible, because HDFS is based on Write-once-read-many architecture HDFS Design.
I'm reading about hadoop with elasticsearch, so I'm confused about how it works.
I guess in this case elasticsearch is substitute for HDFS/Hbase, so I could write hadoop jobs to get and process data in elasticsearch.
Is this correct?
If yes, does this works for Hive and Pig too?
You can use elasticsearch in Hadoop like :
Input and output for MapReduce
Input (storage) for Hive and Pig
Write and read directly in ElasticSearch with Cascading
I need to implement a system which derives analytics/insights from data (Text-only) as well as can do complex search queries.
So I have shortlisted Solr(search) and Hadoop(Analytics). I am unable to decide which base should I use to start. Can we integrate HDFS cluster with Solr? I will be mainly dealing with aggregation queries and data will not update frequently.
I know this question is too broad and general. I just need a expert's opinion on this matter.
Look at Cloudera Search and this
Cloudera Search = SOLR + Hadoop
Using Cloudera Search, you can query the data in Hadoop or HBase using SOLR.
i am using lucene for providing indexing and searching on text file.can i use HDFS for storing index file.
You interchange tasks: instead of thinking where to use Hadoop, first think what you need to implement your project. And if you see that you need Hadoop, it will become obvious where and how to use it.
One tip. Most probably you don't need neither Hadoop, nor even Lucene itself: Solr - search server created on top of Lucene - now has distributed setup, which is specifically designed for indexing and searching; Nutch may be used as front-end for Solr to crawl the web; and Tika may help you to parse all types of offline files.
Lucene comes into picture after all your data is ready in form of lucene documents ( lucene cache ).
Looks like you know Lucene already. The purpose of Hadoop is to reduce a big task into small chunks. I think first usage of Hadoop can be to gather data. Each hadoop node can keep collecting data; and create lucene documents